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ABSTRACT 

Many online networks are not fully known and are often 
studied via sampling. Random Walk (RW) based techniques 
are the current state-of-the-art for estimating nodal attributes 
and local graph properties, but estimating global properties 
remains a challenge. In this paper, we are interested in a fun- 
damental property of this type — the graph size N, i.e., the 
number of its nodes. Existing methods for estimating N are 
(i) inefficient and (ii) cannot be easily used with RW sam- 
pling due to dependence between successive samples. In this 
paper, we address both problems. 

First, we propose IE (Induced Edges), an efficient tech- 
nique for estimating N from an independence sample of 
graph's nodes. IE exploits the edges induced on the sam- 
pled nodes. Second, we introduce SafetyMargin, a method 
that corrects estimators for dependence in RW samples. Fi- 
nally, we combine these two stand-alone techniques to ob- 
tain a RW-based graph size estimator. We evaluate our ap- 
proach in simulations on a wide range of real-life topologies, 
and on several samples of Facebook. IE with SafetyMar- 
gin typically requires at least 10 times fewer samples than 
the state-of-the-art techniques (over 100 times in the case of 
Facebook) for the same estimation error 
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1. INTRODUCTION 

An important and fundamental graph property is its 
size, i.e., number of nodes N. This is a property with 
practical as well as theoretical importance. For exam- 
ple, consider the market value {e.g., , stock price) of 
an online social network (OSN) service provider such 
as Facebook. Among the criteria analysts would con- 
sider when valuing such a firm is its current number of 
users N, and its growth rate {i.e., new users per month). 
These numbers are critical not only for investors, but 
also for various business decisions, such as choosing the 
medium for an advertising campaign, or for lunching a 
social application. 

In some cases, OSN providers officially publish the 



total number of users. However, these numbers may be 
(i) outdated, (ii) incorrect (there exist strong incentives 
to report large N), or (iii) difficult to compare across 
networks {e.g., Facebook publishes the number of its 
active users, which is different from the total published 
by its competitors). 

In other cases, the graph size may be not available at 
all. For example, in distributed computer systems, the 
entities {e.g., nodes in a P2P network) have only a lo- 
cal view of the system (a list of neighbors). While this 
often leads to better scalability and reliability, it makes 
it much harder to obtain system parameters that are 
trivially known in a centralized architecture. One of 
them is the system size N - a common input parameter 
in various distributed protocols, such as overlay main- 
tenance [34] or routing [8]. 

N is also typically unknown in the study of online me- 
dia. WWW, blogging platforms, instant messaging and 
OSNs are all rich in information content contributed by 
millions of individuals and organizations. The knowl- 
edge of the structure and the processes in these in- 
formation networks can be used to track the spread 
of memes (news, topics, ideas, URLs) [10,28], predict 
the outcome of presidential elections [1,2], or improve 
a marketing campaign [7] . Unfortunately, complete so- 
cial media data is often impossible to collect, and the 
results obtained on incomplete datasets are potentially 
biased [5] . To assess the completeness of collected data 
(and thus the extent of this bias), one can compare the 
size of the sampled part with the estimated total size N 
of the information network. 

Finally, estimating the size of hidden populations such 
as drug users or HIV positives is a major challenge in 
the social sciences. One of the main sampling tech- 
niques currently used in this context is a variation of 
RW called Respondent-Driven Sampling (RDS) [13,17, 
41,45]. "RDS is now widely used in the public health 
community and has been recently applied in more than 
120 studies in more than 20 countries, involving a total 
of more than 32000 participants" [13]. Because these 
field studies have a significant monetary cost, any im- 
provement in measurement efficiency directly leads to 
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concrete budget savings. 

In all these cases, it is highly desired to have an effi- 
cient way of estimating the size of a graph based on 
sampled data. One of the most popular — and often 
the only feasible in practice (see Footnote [2]) — sampling 
technique is Random Walk (RW). RW-based sampling 
has been used to sample the WWW [18], P2P net- 
works [12,41,48], OSNs [11,21,38,42], and "offline" so- 
cial networks [13,17,41,45]. In this paper, we focus on 
and derive efficient and practical RW-based estimators 
of graph size N. Our contributions are the following. 

First, in Sec. SI we propose IE (for Induced Edges), 
a family of efficient techniques to estimate based on 
an independence sample (uniform or not) of its nodes. 
IE exploits the number of edges induced on the sam- 
pled nodes (see Fig. [T]Jb)), which is fundamentally dif- 
ferent from the state-of-the-art techniques that exploit 
the node repetitions within the sample (see Fig. [TJa) 
and Sec. |3]). 

Second, in Sec. [H we extend IE to accept RW sam- 
ples. Here, the main challenge lies in that the consec- 
utive RW samples are strongly dependent, which crit- 
ically impacts the estimation results. We address this 
problem by introducing several RW dependence reduc- 
tion techniques, including SafetyMargin - our best per- 
former. SafetyMargin is a stand-alone technique that 
can be applied in other RW-based estimation problems 
as well. 

Third, in Sec. IH we discuss the practical implemen- 
tation issues related to our estimators, and make our 
efficient python implementation available at [23] . 

Fourth, in Sec. [7l we evaluate our approach in sim- 
ulations on a wide range of real-life topologies and on 
several samples of Facebook (confirming the officially 
announced numbers). Compared to the state-of-the- 
art solutions, we typically observe several-fold gain in 
sampling cost for each IE and SafetyMargin, separately. 
When combined together, IE with SafetyMargin usually 
requires at least 10 times fewer samples versus standard 
methods (over 100 times in the case of Facebook) for the 
same estimation error. 

2. NOTATION 

Let G — {V, E) be an undirected, connected graph. 
Let graph size N=\V\ be the number of nodes in G. 
Our goal in this paper is to estimate 7V0 based on a 
sample S = [si, S2, . . . Sn] of n= \S\ nodes, with replace- 
ments. For every sampled node s G S", we know the 
list of its neighbors J\f{s). Depending on the way S is 
collected, we distinguish the following sampling tech- 
niques: 

• Uniform Independence Sample (UIS): The nodes 

^If needed, given the estimated A'', one can easily estimate 
the number of edges as \E\ = A'' ■ (fc)/2, where the average 
node degree (k) can be calculated as in Eq.® or Eq.((T2}. 



a) NODE (state of the art) b) IE (this paper) 




Q Sampled nodes 
O Unsampled nodes 
Induced edges (IE) 

Figure 1: Two families of techniques to es- 
timate the graph size N based on a (uniform 
or non-uniform) sample of nodes, with replace- 
ments, (a) NODE techniques (state of the 
art) exploit node collisions (repetitions). Here, 
among n—11 sampled nodes, we observe n""'''=8 
unique nodes, and n°°'=4 node collisions; NODE 
uses these numbers to estimate N. (b) IE 
techniques (introduced in this paper) exploit the 
edges induced on the sampled nodes (with rep- 
etitions). Here, we have n^^ = 14 such edges. 

in S are collected independently, uniformly at ran- 
dom, with replacement H 

• Weighted Independence Sample (WIS): Every node 
u G S* has a sampling probability proportional to 
its weight w(u). The nodes are sampled with re- 
placement. 

• Random Walk (RW): At every iteration, the next 
node is chosen uniformly at random from all neigh- 
bors of the current nodeH 

3. RELATED WORK (NODE) 

Population size estimation has a long history. Most 
of the existing size estimation techniques are based on 
node repetitions in the collected sample S. We refer to 
these techniques collectively as NODE, and illustrate 
them in Fig. [TJa). 

^Collecting a UIS node sample would be a trivial task if we 
had a list of all nodes in the graph. But then, of course, 
the graphs size needs no estimation - we know it precisely. 
Alternatively, a UIS sample can be sometimes obtained by 
rejection sampling of the userlD space [11]. However, too 
large userlD space, such as 64-bit space currently used by 
Facebook, makes this approach completely impractical (un- 
less some additional features can be exploited, as in [43,53]). 
^Although all our results also apply directly to Weighted 
Random Walks [24], for notation simplicity we limit the pre- 
sentation to simple (unweighted) RWs. 
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3.1 UIS - Uniform Independence Sample 

Under UIS, there exist several existing approaches to 
estimate the population size. The most prominent are 
the following. 

3.1.1 Capture-Recapture 

In this classic method [14,46,47], we independently 
collect two uniform samples S^""^ and Sj""*, without 
replacement. The population size can then be estimated 
by (or by variations of): 



1^2" 



|^„„., ^ ^u„.„| • (1) 

We can apply Eq.dl]) to a UIS sample S by randomly 
splitting S into two equal-sized subsamples Si and ^2, 
and then discarding the repetitions within each of them 
to obtain S'2""' and S'J'""'. 

Note that in this last step, we discard potentially 
valuable information, which may limit the performance 
of the estimator Eq.([T|). Below, we present techniques 
better suited to a UIS sample. 

3.1.2 Unique Element Counting 

Population size estimation can be mapped to the prob- 
lem of estimating the number of species in biology, where 
every node in is a separate species (see [4] for a good 
review). [14], page 73, uses maximum likelihood estima- 
tion (MLE) to derive the following approximation: 

-n/N\ 



^ A^(l 



(2) 



where n"""' is the number of seen species. In our con- 
text, n"""* is the number of unique nodes in S. For 
example, in Fig. [Ija), n""'' = 8. Because N is the only 
unknown, we can solve Eq.([2]) and obtain an estimate N 
of the population size TV. 

An exact version of this MLE estimator was first 
given in [6] defining iVuis as the smallest integer TV > 
^uniq satisfies 



iV-l- 1 



N +1 



N 



" VA^ + 1 



< 1. 



(3) 



Its uniqueness was shown in [9]. [52] proposed an effi- 
cient method of evaluating it. 

3.1.3 Collision Counting 

Another approach is to study the number n""' of col- 
lisions in the sample [3,19,36]. A collision is a pair of 
identical samples. More precisely, 



■K'J 



(4) 



For example, in Fig. (Tfa), 71''°' — 4. Note that we usu- 
ally have n""^ -I- n"'"'' 7^ n. We can now estimate the 

*In the sum in Eq.Q and in many equations that follow, 
indexes i,j run from 1 to n, i.e., across the entire sample S. 



population size iV by [19] 



(5) 



3.2 WIS - Weighted Independence Sample 

[19] provides an elegant extension of the estimator 
Eq.(IS|) to cover the WIS follows 



^ ' ^ w(s) 



(6) 



Under UIS, Eq.® reduces to Eq.([5]). However, it was 
shown in [19] that Eq.® under WIS typically outper- 
forms Eq.® under UIS. 

3.3 Other Approaches 

Various other approaches to the size estimation prob- 
lem exist, but are not presented here because they: 
(i) depend on special features of the network or service 
being studied, and are not broadly applicable; (ii) de- 
pend on case-specific knowledge of the network or ser- 
vice (again, limiting applicability); and/or (iii) are less 
efficient than the approaches described above. Among 
the more prominent of these are the following. 

Random Walk (RW) Tours. 

This is a family of techniques [35,36,52] that perform 
RWs until they return to the starting node. However, 
these approaches are very inefficient, as they often re- 
quire a sample comparable with the graph size [19,36, 
52] , and therefore we do not consider them in this paper. 

Traceroute. 

The Internet at the IP layer gives us another widely 
used sampling method, traceroute, which can be in- 
terpreted as a rough approximation of a shortest path 
between two nodes. [20,50] propose two graph size esti- 
mators that take as input a traceroute sample. 

Model-based Estimation. 

Finally, one can assume something about the distri- 
bution of involved variables, which leads to a model- 
based estimation. Recently, [39] used such an approach 
to estimate the number of Bluetooth devices in an en- 
closed area. 

4. INDUCED EDGE (IE) TECHNIQUES 

In this paper, we take an approach that is fundamen- 
tally different from the state-of-the-art NODE family 
of techniques described in Sec. [3] and Fig. [IJa). We 
consider not only the sampled set S of nodes, but also 



^Strictly speaking, [19] considers only the case with node de- 
grees serving as node weights, i.e., with w{v) — deg(ii). The 
more general version given here trivially follows from [19]. 
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their samplecj^ neighbors. In other words, we study the 
edges of G induced on S, as illustrated in Fig.[TIb). We 
therefore refer to this family of techniques collectively 
as IE (Induced Edges). Under IE, we observe an edge 
{u, v} only when u,v G S, i.e., when both its end- nodes 
are sampled. Let n^^ be the number of such edges, i.e., 



n 



El 

i<j 



{{s,.,s,}eE}- 



(7) 



Note that if nodes are repeated in S, we count every 
occurrence of each node separately. For example, in 
Fig-IHb), n'^ ^ 14. 

Intuitively, IE has large potential, especially in dense 
graphs. Indeed, in a particular iteration of UIS, node v 
is re-sampled with probability equal to l/N, but one 
of v's neighbors is sampled with probability deg{v)/N. 
Consequently, IE observes (k) (average node degree) 
times more collisions than NODE, which, for typical 
online graphs such as Facebook with (fc) = 150+, pro- 
vides far more information to exploit. 

Below, we develop two approaches that exploit IE: 
lEl and IE2. In their basic forms, they accept as input 
independence node samples (UIS or WIS); we extend 
them to RW samples in Sec. [5j 

4.1 lEi: Graph Density 

We will exploit the following graph identity [15] 



TV = 



{k) 



1, 



where p = jv(^lfi^i) is the graph density, and (fc) = 
is the average node degree. 

4.1.1 UIS 

Under UIS, the average node degree (k) can be easily 
estimated from our sample S as 



(8) 



{k) = 



ses 



(9) 



The graph density p can be interpreted as the proba- 
bility that two different nodes u and v, u ^ v, chosen 
uniformly at random, are adjacent. This can be esti- 
mated by inspecting all pairs of different nodes in our 
sample 5', and counting the fraction of them that actu- 
ally forms edges, i.e.. 



E 



{{si,sj}GE} 



n(n- l)/2' 



(10) 



In theory, one could also consider non-sampled neighbors, 
which is sometimes referred to as star [20] or social [22] 
sampling. However, except for some special cases [24,25], 
the resulting estimators would require the knowledge of the 
sampling weights (degrees) of non-sampled nodes [22], which 
is rarely available in practice. 



By plugging Eq.® and Eq.dTU]) in Eq.®, we then ob- 
tain the size estimator 



(n-1). Vdeg(s) 



ses 



2 ■ n' 



(11) 



4.1.2 WIS 



Under WIS, node v has assigned sampling weight 
w(?;), which creates a linear bias towards nodes with 
higher weights. We can correct for this bias by apply- 
ing the Hansen-Hurwitz technique [11,16], which con- 
sists of dividing by w(s) every term related to s G 
S. Consequently, the corrected version of Eq.® be- 
comes [11,20,41]: 



deg(s) 



(fc) = 



w(s) 



E 

s6S 



(12) 



w(s) 



Similarly, we apply a two-point correction [20] to node 
pairs in Eq. ljlOl) . to obtain the density estimator 



y ' 

^ w{s^)w{sj) 



(13) 



Finally, by plugging Eq.([n]) and Eq.dTS]) in Eq.®, we 
obtain the size estimator 
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(14) 



4.2 IE2: Arbitrary A and Sample S 

In this technique, we assume that we have two sets of 
nodes, A C V and S C V. A is an arbitrary subset of V 
(possibly with repetitions). S is an independence sam- 
ple (UIS or WIS), with replacement. Our main object 
of study is the number of cross-collisions between 
S and A, i.e., 



EEi 

seS aeA 



{s=a}- 



4.2.1 UIS 



Under UIS, every node in S is selected uniformly at 
random from all N nodes. Consequently, the probabil- 
ity that s £ S collides with a given a G A is 



Pr(s = a) = 



1 

N' 



4 



So the the expected number of coUisions is 

11^^ [n \ = 2. -Ti'l'S = a) = 
seS,aeA 



N 



By replacing E [71"'=°'] with the value n"""' measured in 
reality, we obtain the following size estimator: 

l^ll^l 



-"uis 



(15) 



This resembles capture-recapture Eq.([T]), except that 
here only one phase (S) is uniform, and the other one ( A) 
is arbitrary. Moreover, we allow for repetitions. 

4.2.2 WIS 

Let us first re- write Eq. p^ as 



UIS 



ses 



(16) 



Under WIS, as in Sec. I4.1.2[ the application of the 
Hansen-Hurwitz estimator to the terms related to S 
leads to 



■"wis 



sgS ^ ^ 



(17) 



4.2.3 How to Choose A? 

Note that in all the derivations above, A is an arbi- 
trary set (or multiset) of nodes. The only assumption 
is that S is drawn independently from A. If we happen 
to have such a set A [e.g., from previous measurements 
or other sources), then we can employ it with our es- 
timators. Otherwise (and more conveniently), we can 
choose A to be all neighbors of nodes in S, i.e., 



A = [J A/'(s') (set or multiset). 



(18) 



s'es 



With this approach, we obtain an A that is: 

• Relatively large. Indeed, \A\ ~\S\ - (k) under UIS, 
and \A\ « IS*! • {P)/{k) under WIS. 

• Generally free, i.e., with no additional sampling 
cost. In most graph exploration contexts, we auto- 
matically obtain for every s E S a. list of its neigh- 
bors Af{s), and can thus employ this list without 
additional queries. 

• Almost independent of S. Clearly, each node s £ S* 
determines the Af{s) that, in turn, is added to A 
(and, consequently, s cannot collide with any node 
from JV{s)). However, s is independent of all the 
remaining nodes in A, so the dependence of S on 
A quickly diminishes with growing sample size n= 
\S\. 



• Of potentially unknown distribution. For example, 
when S follows WIS, A depends on graph assor- 
tativity [40]. Similarly, if we discard duplicates, 
nodes in A may follow a very complex distribu- 
tion [26]. However, none of these is a problem, 
because the estimators above accept an arbitrary 
set A. 

Moreover, under A selected by Ea. ([T5)) (using mul- 
tiset), we have n^™' = n™. So, again, we count edges 
induced on the sampled nodes S (which explains why 
we use IE to refer to this category). 

Set or Multiset? We can either keep potential node 
duplicates in A (i.e., make A a multiset), or discard 
them {i.e., make A a set). Because A can be arbitrary, 
both of these approaches work well. However, we found 
in simulations that the latter version sometimes per- 
forms significantly better (especially in highly skewed 
degree distributions), and never worse. For this reason, 
unless explicitly noted, we will henceforth discard all 
duplicates in A. 

4.3 lEl VS. IE2 

In all the experiments we conducted both lEl and IE2 
proved asymptotically unbiased. However, IE2 consis- 
tently performed better than lEl in terms of variance. 
This is because lEl requires two-point correction: a 
single edge {u,v} S E may have substantial weight in 
Eq. ()14|) if w(u) and w{v) are small (i.e., exactly when 
u and V are rarely sampled), increasing the variance 
of the estimator. In contrast, IE2 uses only one-point 
corrections, which makes it more robust. 

For this reason, and to improve the paper's readabil- 
ity, we will henceforth use only the IE2 technique, and 
we will refer to it simply as IE. 

5. DEPENDENCE REDUCTION FOR RW 

Both UIS and WIS select nodes independently. In 
practice, this can be difficult or impossible to achieve 
(see our discussion in Footnote [2]) . In contrast, one can 
often perform a random walk (RW) , as commonly done 
in WWW [18], P2P networks [12,41,48] and OSNs [11, 
21,33,38,42]. In an undirected, connected and acyclic 
graph, RW visits node w at a given step with probability 
proportional to its degree deg(z;). Therefore, one could 
be tempted to set w{v) = deg(w) and apply directly the 
WIS estimators from Sec. [H 

Unfortunately, as we demonstrate in Fig. [2l this ap- 
proach fails. Indeed, under RW, the estimate TV can 
be arbitrarily small for small n. This effect fades away 
for much larger n, say for n > N. However, taking so 
large sample is of course impractical - the central goal 
of sampling is to estimate some properties based on a 
relatively small sample, i.e., where n <^ N. 

The WIS estimators fed directly by RW samples per- 
form poorly because of the strong dependence between 
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Figure 2: The estimated size N/N, as a function 
of relative sample length n/N, for p2p-Gnutella31 
graph (see Table [3]); other graphs yield analo- 
gous results. 



consecutive draws. Assume, for example, that our RW 
sample consists of just three nodes, i.e., S = [si, S2, S3]. 
Under RW, si and S3 collide (si = S3) with probabil- 
ity equal to l/deg(s2). In contrast, under WIS, this 
probability may be arbitrarily close to (for N — 00). 
So RW experiences increased number of collisions n""', 
which leads to the underestimation of N. 

Clearly, in order to apply the WIS graph size estima- 
tors to a RW sample, we have to reduce the dependence 
created by the underlying Markov chain. Below, we de- 
scribe one simple dependence reduction technique used 
in the MCMC literature, and then we propose signifi- 
cantly more efficient techniques. 

5.1 SimpleThinning 

The authors of [19] reduce RW dependence by taking 
every Oth sample from S, where is a thinning param- 
eter. The resulting subsample 



S' — [Sl, Sl+0, Sl+26/, . 



(19) 



is then fed to Eq.® to obtain a size estimate. 

This approach has several drawbacks. First of all, 
{9 — l)/9 samples are dropped, which is a clear waste. 
Second, as we will see in Sec. [3 it may be very challeng- 
ing (often impossible) to find the optimal value of 6. 

5.2 ShiftedThinning 

SimpleThinning can be easily improved by observing 
that, rather than one, we obtain 9 different subsamples: 

^'k = [si+fe: si+k+e, si+fc+2e, • • •], k ~ . . .9 — I. 

(20) 

One way to exploit all these subsamples is to apply 
a size estimator to each of them, creating 9 different 
estimates N{S'i,). We may take the mean or median of 
them as our final result, e.g., 



N, 



aggregated 



(21) 



However, this can be problematic e.g., if for some or 
many fc, we have N{S'f.) = 00. Instead, we propose to 
aggregate the 9 estimates is by applying 



^ numerator(7V(S'^)) 



Na. 



ggregated 



dcnominator(A^(5^,)) 



(22) 



where numerator(7V(S'^)) is the numerator of the esti- 
mator N{S'^)\ analogously for the denominator. This 
approach avoids the N{S'f.)= 00 problem and performs 
(in simulations) consistently better than Ea. (l2T]) . We 
will henceforth use ShiftedThinning to refer to Eg. (|22| . 



5.3 SafetyMargin 

We propose yet another approach to reduce depen- 
dence in a RW sample. Our main idea is to ignore the 
information brought by pairs of nodes that are less than 
m samples away. This should leave us with pairs of in- 
dependently selected nodes only. To achieve this, we 
must in some cases modify our estimator. 

Applying this idea to NODE (Eq.(l6])) is rather straight- 
forward, and leads to 



E 



w(sO 

w(s,) ■ ^{l^'-^l>™ 



(23) 



In contrast, the IE estimator (Eq. pT|) with multi- 
set A) require some additional transformations, as fol- 
lows. First, note that 

^/(a)^^ ^ f{a), and |A|=^deg(s) 

aeA ses ae^fis) ses 

Consequently, Eq. p7|) can be rewritten as 
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^^^^ E hs.=a} 
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deg(st) 
w(sj) 



E 



w(si) 
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Now, it is easy to exclude the pairs of nodes lying within 
TO hops, i.e., 

deg(si 



2>J 



• 1 



{|j-i|>m} 



E 



{sigA/-(^,)} 
w(si) 



(24) 



• 1 



{\j-i\>m} 



Ea. ((M| interprets A as a multiset, which is different 
from the set version that we suggested in Sec. 14.2.31 
Although we omit it here (for brevity), we implemented 
the latter and use in in the evaluation. 

Finally, we would like to note that SafetyMargin nat- 
urally fits the sampling strategies using multiple inde- 
pendent RWs (as e.g., in [11]). Indeed, it is enough to 
replace in Eg. (1^51) and Eq. ([M)) every term l{|j_i|>„j} 
with l{waikcr(i)5:^waikcr(j)}> whcrc walker(i) is the walker 
that contains sample Si. In other words, we consider 
only the node pairs where nodes come from different 
walks, and are thus independent. Note that the result- 
ing estimators have no explicit parameter to. 

5.4 Comparison 

To compare our dependence reduction techniques, first 
note that the main information exploited by our estima- 
tors lies in pairs of sampled nodes. For example, Eq.([6]) 
uses Eq.(|l]) that explicitly considers all node pairs and 
counts their collisions. Similarly, in denominator of 
Ea. ((T7)) with A constructed by Ea. ((T5|) . we count colli- 
sions between a sampled node Si and the neighbors of 
another sampled node Sj. Consequently, the efficiency 
of an estimator grows with the number of node pairs it 
considers. 

In the entire sample S*, \S\ = n, we have n(n — 1) = n? 
node pairsQ UIS and WIS estimators make use of all of 
them. In contrast, the RW dependence reduction tech- 
niques proposed in Sec. IS.lI Sec. 15.31 may significantly 
reduce the number of considered pairs, as follows (see 
Table [J). 

SimpleThinning uses only ^ nodes, which results in 

2 2 

^ node pairs. Analogously, ShiftedThinning uses ^ 
node pairs for each k = ... — 1 , which results in 

2 o 

the total of ^ node pairs. Finally, from all n node 
pairs, SafetyMargin drops 2to pairs in the neighborhood 
of each node. Since there are n such nodes, we keep 
n^ — 2nm = n{n — 2m) node pairs. 

Both 9 and to signify the same notion — the number of 
Markov chain steps such that the dependence between 
RW samples becomes negligible. In typical networks 
this happens for 9{= to) in the order of tens to hun- 
dreds [38]. Consequently, <^ ^ n{n — 2m), and 
we may expect the SafetyMargin to perform best. 

'^In this simple calculation, we count separately pairs {si, Sj) 
and (sj. Si). Indeed, these two pairs bring different informa- 
tion to Eq.dlT} with Eq. lfTS)) . 



Dependence Reduction Method 


Node Pairs 


SimpleThinning 


01 


ShiftedThinning 


n 

e 


SafetyMargin 


n(n — 2m) 



Table 1: Approximate number of node pairs 
exploited by each of the dependence reduction 
techniques. 

6. IMPLEMENTATION ISSUES 

A straightforward, naive implementation of the above 
estimators can easily lead to O(n^) time complexity, 
where n= [S*] is the sample size. This is the case, for 
example, for the sum 



E 



1 



w(Si)w(Sj) 



in Eq. ([T4l) . Although not a problem for small samples, 
0{n?) may become an issue, say, for n > lOOK. Be- 
cause our real-life samples are often significantly larger 
{e.g., we sampled millions of Facebook nodes), we had 
to look for more efficient implementations of our esti- 
mators. Fortunately, all of them can be rewritten to 
use only 0{n) time complexity. For example, one can 
easily show that the above sum is equal to 




E 



1 



(w(s.))^ 



Things become more complicated for RW-targeted es- 
timators in Sec. [5j Here, the corresponding sums are 
much more interdependent (especially when we use the 
"set" version of Eq. ([T8l) ). and thus difficult to separate. 
However, even in this case, the time complexity can be 
kept linear, with the help of some auxiliary dedicated 
data structures. 

Our python implementation available at [23] guaran- 
tees 0{n) for all estimators derived in this paper. 

7. PERFORMANCE EVALUATION 

In this section, we evaluate the NODE and IE esti- 
mators under three sampling techniques UIS, WIS and 
RW. We apply them to a wide spectrum of real-life fully 
known topologies (Sec. 17. ip and well as to several sam- 
ples of Facebook (Sec. 17. 2|) . Table [2] summarizes the 
concrete estimators we used in this study. 

7.1 Fully Known Topologies 

We first evaluate our estimators on fully-known topolo- 
gies, which allows us to compare the results directly 
with the ground-truth graph size. We used 19 real-life 
topologies coming from various fields, with up to mil- 
lions of nodes and tens of millions of edges. They are 
summarized in Table [S] 
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IN yJUrj 


iiN J_J 


UIS 


Eq.10 


' Eq.(15) 


WIS 


Eq.® 


Eq.(l?il 


RW, SimpleThinning 


Ea.l|6)l4-Ea.l(T9t 


Ea.(17)+Ea.lfT9)l 


RW, ShiftedThinning 


Ea.ll6lhf-Ea.|[22ll 


Eq.(17H-Eq.(22]) 


RW, SafetyMargin 


Eq.l|231l 


^ Eg. (,24,1 



Table 2: Estimators used in simulations. The 
shades of gray correspond to those used in Fig.jSl 
Fig. H and Fig. M 



7.1.1 UIS 

In Fig. El^a), we present the simulation results under 
UIS sampling. First of all, we observe that both NODE 
and IE converge to the correct value (1.0 on y-axis) 
as sample size n grows. Second, in all cases, IE out- 
performs NODE. For example, for BerkeleylS, the IE 
estimator with sample size n = 200 performs similarly 
to NODE with n= 2000. This means that IE reduces 
the sampling cost by 90%, compared to NODE. This 
advantage of IE over NODE depends on many factors, 
in particular on mean degree. Indeed, all graphs with 
high average node degree k experience several-fold im- 
provement under IE. In contrast, for graphs with fc < 5 
{e.g., "email-EUAll" or "roadNet-PA" ) , the difference 
is much less pronounced. 

7.1.2 WIS 

Under WIS (see Fig. E^b)), the efficiency of both 
methods improves, especially for sparser topologies. This 
is because heterogeneous sampling weights result in more 
collisions in the sample, which, in turn, gives the esti- 
mators more information to exploit. (The same phe- 
nomenon has already been observed for NODE in [19]). 
However, the relative advantage of IE over NODE re- 
mains roughly the same as under UIS, and is, again, 
primarily determined by mean degree. 

7.1.3 RW 

In Fig. m we present the simulation results for RW 
sampling, with two dependence reduction techniques: 
Thinning (a) and SafetyMargin (b). For each topology, 
we fix the sampling budget, and we vary the thinning 
parameter 9 in (a) and the margin m in (b). 

Thinning. 

We analyze Thinning in Fig. IDJa). In general, IE 
with ShiftedThinning outperforms NODE with Shift- 
edThinning, which in turn outperforms NODE with 
SimpleThinning. This is in agreement with our anal- 
ysis in Sec. 15.41 All versions of thinning follow the same 
general pattern with two or three regimes of 9: 

1. Underestimation: For 9-^1, the thinning is too 
weak, and the RW dependence results in a system- 
atic underestimating of size. 



name 


nodes 


edges 


(deg; 


(dcg2> 
<dog> 


BerkeleylS [49] 


22K 


852K 


74.4 


167.0 


Texas84 [49] 


36K 


1 590K 


87.5 


212.1 


Facebook-New-Orleans [51] 


63K 


816K 


25.8 


88.1 


livejournal-links [37] 


5 189K 


48 688K 


18.8 


155.4 


orkut-links [37] 


3 072K 


117 185K 


76.3 


390.3 


soc-Epinionsl [44] 


75K 


405K 


10.7 


183.9 


soc-Slashdot0811 [32] 


77K 


469K 


12.1 


147.0 


youtube-links [37] 


1 134K 


2 987K 


5.3 


494.5 


email-EuAU [31] 


224K 


339K 


3.0 


567.6 


flickr-links [37] 


1624K 


15 476K 


19.0 


949.2 


wiki-Talk [29] 


2 388K 


4 656K 


3.9 


2 705.4 


as-skitter [30] 


1694K 


11 094K 


13.1 


1445.1 


cit-Patents [30] 


3 764K 


16 511K 


8.8 


21.3 


ama^nnOfini [971 


403K 


2 443K 


12 1 


30.6 


as-caida20071105 [30] 


26K 


53K 


4.0 


280.2 


ca-CondMat [31] 


21K 


91K 


8.5 


22.5 


p2p-Gnutella31 [31] 


62K 


147K 


4.7 


11.6 


web-Google [32] 


855K 


4 291K 


10.0 


170.4 


roadNet-PA [32] 


1087K 


1541K 


2.8 


3.2 



Table 3: Topologies used in offline simulations in 
Sec. 17.11 (dcg) is the average node degree, (deg'^) 
is average squared node degree. High value of 
(deg^)/(deg) compared to (dcg) indicates a highly 
heterogeneous node degree distribution. 



2. Flattening (not guaranteed): In some topologies, 
for some range of 9, the estimate stabilizes around 
the true value, with acceptable variance. 

3. Overestimation: For — > cxd, we observe no col- 
lisions within the thinned samples and thus our 
estimate is often N— oo. This effect can be eas- 
ily observed for NODE where many plots shoot 
upwards for larger 9. 

Only if Flattening is present (and well pronounced) can 
one try to interpret the results and estimate the graph 
size. In Fig. |3Ka), this is the case, say, for BerkeleylS, 
Texas84, orkut-links and wiki-Talk under IE (al- 
though this assessment is very subjective in nature). In 
all other cases, including all NODE cases. Flattening 
does not occur, making them essentially impossible to 
interpret. 

SafetyMargin . 

In contrast, the Safety JVIargin performs very well, as 
shown in Fig. |4l(b). Here, we can observe the same 
three regimes ( Underestimation, Flattening, Overesti- 
mation), as under Thinning. However, now Flattening 
is very well pronounced: it spans a wide range of mar- 
gins m, yields a relatively small variance, and concen- 
trates around the true value. This makes the results 
much easier to interpret. 

The only exception is roadNet-PA (last topology), 
where all of our RW estimators fail miserably. This is 
probably because roadNet-PA represents a road net- 
work, which is typically a lattice-like, almost planar 
graph, with a very large diameter (here diam=782). 
Consequently, the mixing time of RW (and thus the 



8 



S 1-5 
E 



0.0 

10- 
2.0 





















Berkeley 13 




Texas84 




-ace boo k-New-Or lean: 



li^journal-lin 




10 10 10 

n/N 



(a) UIS 



BHHE 

BerkeleylS Texas84 =acebook-New-Orlean;i nivejournal-links 

'lO"^ 10"' 10° 10"^ 10"' 10° 10"^ 10"' 10° 10 ' 10"' 10"^ 



youtube-links 





10'^ 10^ 10'^ 10"^ lO'^lO"* 10"^ 10"^ 10^ 10 
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; 

; 
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0.0 
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ca-CondMat 
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n/N 



lOlO"' 



10"' 

n/N 




I I NODE 

' ' {state of the art) 



10" 10"' 10"' 10"' 10" 10"' 10"' 

n/N n/N 



(b) WIS 

Figure 3: [UIS and WIS] The estimated size N relative to the real size N for 19 real-life fully known 
topologies, as a function of relative sample length n/N. We use t"wo sampling techniques: UIS (a) 
and WIS (b) (under WIS, nodes were selected from the stationary distribution of RW). We consider 
two estimation techniques: NODE (light gray) and IE (dctrk gray). For every sample length n, we 
performed 500 experiments. The grey regions cover the 500 results from 10th percentile to 90th 
percentile, with the median set in dotted line. 
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10° 10^ 10^ 10° 10^ 10^ 10° 10^ 10^ 10° 10^ 10^ 10^ 10° 10^ 10^ 10^ 



thinning thinning thinning thinning thinning 



(a) SimpleThinning fSec. 15. ll and ShiftedThinning fSec. 15.211 . On x-axis, we vary the thinning parameter 0. 




margin margin margin margin margin 



(b) SafetyMargin (Sec. El 



Figure 4: [RW] The estimated size N relative to the real size N, for 19 real-life fully known topologies 
sampled with RW of relative length n/N given in top left corners. We use two RW dependence 
reduction techniques: Thinning (a) and SafetyMargin (b). Under each, we test two estimation 
techniques: NODE (light and medium grey) and IE (dark grey). For every topology, we performed 
500 experiments. The grey regions cover the 500 results from 10th percentile to 90th percentile, with 
the median set in dotted line. 
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thinning thinning thinning thinning thinning thinning thinning 



Figure 5: [RW] NODE with SimpleThinning (state of the art) for RW with ten times larger sampling 
budget than in Fig. [4](a,b), in 7 example topologies. 



desired margin m) is very large, possibly larger than 
the sample sizes we tested. Indeed, under the absence 
of RW dependence, our estimators perform well, as pre- 
sented in Fig.[3fa,b). 

Comparison with State-of-the-art Techniques. 

To date, the state of the art has been NODE with 
SimpleThinning [19]. We show its performance in Fig.[5l 
for RW ten times longer than in Fig. |4lja,b). None of 
the presented plots enters the Flattening regime, which 
makes the estimation impossible (the same holds for 
NODE with ShiftedThinning, not shown). In contrast, 
IE with SafetyMargin in Fig. IH^b), performed very well 
even for RW samples of 1/lOth the length. This means 
that, compared to the state of the art, our techniques 
achieved here more than 10- fold reduction in sampling 
cost. 

Interestingly, a closer comparison of Fig. |3l^a) with 
Fig. HJb) reveals that IE with SafetyMargin applied to 
RW (thus a highly interdependent and challenging sam- 
ple) is often better than NODE applied to UIS (inde- 
pendence sample). 

7.2 Online experiments 

Finally, we test our techniques in online experiments 
on Facebook, where the entire topology is unknown to 
us. We use samples collected in two different periods of 
time: 

Facebook'09 [11]: RW and UIS, with IM users each. 
Facebook'lO [24]: RW covering IM users. 
We show the results in Fig. [HI 

7.2.1 UIS 

In the top- right plot in Fig. [Bl we estimate the size of 
Facebook'09 based on a UIS sample of its users [11], as a 
function of the sampling length n. Both NODE and IE 
return values concentrated around = 240M (already 
reported in [19] with NODE), which is in agreement 
with what Facebook claimed at that time. However, 
the NODE values are much more dispersed than those 
of IE. For example, for n = lOOK, NODE 10-90 per- 
centiles correspond to IE at less than n = lOK. This 
means that with IE we need ten times fewer samples 
to achieve the NODE's accuracy, which translates into 



10- fold reduction in sampling cost. 

We should note, however, that a UIS sample of nodes 
is rarely available. For example, the UIS sample used 
above was obtained through rejection-sampling of the 
entire 32-bit userlD space. Soon afterwards, Facebook 
moved to a 64-bit space, which makes this approach 
completely impractical. In this case, one has to use 
other methods, such as RW. Unlike existing methods, 
our techniques continue to be useful in this case. 

7.2.2 RW 

All the remaining plots in Fig. [H] are generated based 
on RW samples of Facebook nodes, with different de- 
pendence reduction techniques. 

The left-most column uses Thinning. Similarly to 
(most of) Fig. mja), the estimates do not stabilize with 
the thinning parameter 0, which makes the results prac- 
tically impossible to interpret. 

In contrast, SafetyMargin applied to the same RW 
samples (second column in Fig. [5]) performs very well 
and leads to good and concentrated size estimates. 

The third column of Fig. [51 is our attempt to compare 
the efficiency of our estimators. To this end, we applied 
NODE with Thinning to the entire RW sample (with 
1M=1 million nodes), which resulted in a single esti- 
mate per 0, represented by the light- and medium-grey 
lines. Next, we applied IE with SafetyMargin to one 
hundred 1 OK- long chunks of our RW sample (dark-grey 
region). In both datasets, the state-of-the-art solution, 
i.e., NODE with SimpleThinning, performed badlyll In 
contrast, IE with SafetyMargin leads to very reasonable 
estimates. Because the latter uses 100 times fewer node 
samples, we conclude that in RW sampling of Face- 
book, our techniques lead to at least 100- fold reduction 
in sampling cost. 

8. CONCLUSION AND FUTURE WORK 

In this paper, we began by introducing IE, an effi- 
cient technique to estimate the size of a graph, based on 
an independence sample (uniform or not) of its nodes. 
In many practical applications, however, independence 

^Although ShiftedThinning improves the results, they are 
still impossible to interpret, especially under Facebook'lO. 



11 




Figure 6: Online experiments on Facebook'09 (top) and Facebook'lO (bottom). We use RW and UIS 

(for '09 only) sampling techniques, and consider two size estimators: NODE (light and medium gray) 
and IE (dark gray). The grey regions cover the results from 10th percentile to 90th percentile, 
with the median set in dotted line. 



sampling is not possible, but it is relatively easy to per- 
form a Random Walk (RW) in the graph. Because of 
the strong dependence between consecutive nodes in an 
RW sample, neither standard estimators nor IE can use 
such data without adjustment. To address this prob- 
lem, we introduced SafetyMargin - a technique that cor- 
rects the estimators for dependence in RW samples, and 
is applicable to both IE and already extant estimation 
methods. 

We evaluated our techniques in simulations on a wide 
range of fully known real-life topologies, and on sev- 
eral samples of Facebook (confirming the officially an- 
nounced number of users). We found that, for the same 
estimation error, IE with SafetyMargin often requires 
10-|- times fewer samples than the state-of-the-art so- 
lutions. In particular, for Facebook, we observed more 
than 100-fold reduction in sampling cost. 

A python implementation of all estimators used in 
this paper, optimized to guarantee 0{n) time complex- 
ity, is available at [23]. 

In future work, we plan to study ideas that can fur- 
ther improve the efficiency of our estimators. For ex- 
ample, one can try to use all neighbors of the sampled 
nodes, rather than the sampled neighbors only, or to 
combine NODE and IE together. Another challenge is 
to extend these results to directed graphs, for which 
RW sampling weights are harder to obtain than in the 
undirected case; the theory here is unchanged, but the 



challenges associated with RW sampling per se are sig- 
nificant. Finally, we plan to study how the SafetyMar- 
gin applies to other problems, e.g., to the estimation of 
graph clustering and assortativity. 
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